Microsoft’s Igor Rondel, Principal Development Manager of the Bing Index Quality team published a blog post on the Bing Search blog named Web Spam Filtering. In the post, Igor shares how Bing goes about detecting, processing, and filtering out search spam from their index. Bing outlines some ways they use methods of discovering and then […]
Microsoft’s Igor Rondel, Principal Development Manager of the Bing Index Quality team published a blog post on the Bing Search blog named Web Spam Filtering. In the post, Igor shares how Bing goes about detecting, processing, and filtering out search spam from their index.
Bing outlines some ways they use methods of discovering and then filtering such spam within the algorithm. They include:
Content Quality
Accessing the quality of content: Bing explains:
This itself is a huge and important concept that we’ll deep dive into in a future blog. At a high level, provided a spammer’s overarching goal is to drive ad and affiliate clicks, the content of the page is important only to the extent that it helps facilitate said goal. To put it another way, spammers generate content targeted at search engines and their algorithms, whereas legitimate SEOs generate content for their customers. The result is that, in most cases, spam pages have inadequate content with limited value to the user. We use this fact to facilitate detection. There are literally hundreds, if not thousands, of signals used to make this assessment, ranging from simple things like number of words on the page to more complex concepts of content uniqueness and utility.
Ad Location & Quantity
Bing looks at the presence of ads on a page: Bing explains:
Just about every page on the web contain ads. Presence of ads doesn’t make the page bad, let alone spam. What we care about are things like a) how many ads appear on the page, b) what type of ads (e.g. banner, grey-overs, pop-ups), and c) how intrusive/ disruptive they are.
Page Layout
Bing also looks at the position & layout of the information on the page. Bing explains:
Where is the main content located? Where are the ads located? Do the ads take up the prime real estate or are they neatly separated away from the main content (e.g. in the header/ footer or side pane)? Is it easy for users to mentally separate content from ads?
Spammers Use Content Generation Techniques:
Bing explains spammers use content generation techniques to quickly “maximize web presence” through mass content production via (a) copying other’s content (either entirely or with minor tweaks), b) using programs to automatically generate page content, c) using external APIs to populate their pages with non-unique content. Bing counteracts these efforts by using “creative clustering algorithms” to detect these attempts.
Spammers Use Other Techniques To Boost Rankings:
Bing adds that spammers use other methods such as a) stuffing page body/ url/ anchors with keywords, b) performing link manipulation via link farms, link networks, forum post abuse and c) including hidden content on the page not meant for human consumption. To counteract these, Bing uses algorithms to look for content outliers across the web and if things look unnatural, it can be detected. For link manipulation, Bing may use their web graph (page/ site inlinks and outlinks) to identify possible link manipulation.
Action Taken On Spam/Spammers
Bing will take different levels of action on spam they find including (a) demoting the page, (b) neutralizing the effect of specific spam techniques or (c) removing the page/ site out of the index all-together. The level of action depends on a) the extent/ egregiousness of the spam techniques involved and b) the potential value the page presents to the users.